Anatomy of a Resource Management System for HPC Clusters
نویسندگان
چکیده
Workstation clusters are often not only used for high-throughput computing in time-sharing mode but also for running complex parallel jobs in space-sharing mode. This poses several difficulties to the resource management system, which must be able to reserve computing resources for exclusive use and also to determine an optimal process mapping for a given system topology. On the basis of our CCS software, we describe the anatomy of a modern resource management system. Like Codine, Condor, and LSF, CCS provides mechanisms for the user-friendly system access and management of clusters. But unlike them, CCS is targeted at the effective support of space-sharing parallel and even metacomputers. Among other features, CCS provides a versatile resource description facility, topology-based process mapping, pluggable schedulers, and hooks to metacomputer management.
منابع مشابه
Management of Virtual Large-scale High-performance Computing Systems
Linux is widely used on high-performance computing (HPC) systems, from commodity clusters to Cray supercomputers (which run the Cray Linux Environment). These platforms primarily differ in their system configuration: some only use SSH to access compute nodes, whereas others employ full resource management systems (e.g., Torque and ALPS on Cray XT systems). Furthermore, the latest improvements i...
متن کاملJMS: An Open Source Workflow Management System and Web-Based Cluster Front-End for High Performance Computing
Complex computational pipelines are becoming a staple of modern scientific research. Often these pipelines are resource intensive and require days of computing time. In such cases, it makes sense to run them over high performance computing (HPC) clusters where they can take advantage of the aggregated resources of many powerful computers. In addition to this, researchers often want to integrate...
متن کاملA Comparison of Job Management Systems in Supporting HPC ClusterTools
This paper compares three most common job management systems and their workings with Sun HPC ClusterTools 3.1. Various aspects such as installation, customization, scheduling and resource control issues are discussed. The three chosen systems are: Load Sharing Facility (LSF), Portable Batch System (PBS) and COmputing in DIstributed Networked Environment (CODINE)/ Global Resource Director (GRD)....
متن کاملImproving the Eco-Efficiency of High Performance Computing Clusters Using EECluster
As data and supercomputing centres increase their performance to improve service quality and target more ambitious challenges every day, their carbon footprint also continues to grow, and has already reached the magnitude of the aviation industry. Also, high power consumptions are building up to a remarkable bottleneck for the expansion of these infrastructures in economic terms due to the unav...
متن کاملObject Storage: Scalable Bandwidth for HPC Clusters
This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storage in High Performance Computing (HPC) environments. An HPC environment requires a storage system to scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharing and managing data. Traditional storage solutions, including disk-per-node, Storage-Area Netwo...
متن کامل